Train a sklearn.ensemble.RandomForestClassifier
that given a soccer player description outputs his skin color.
Once you assessed your model,
feature_importances_
attribute and discuss the obtained results. feature_importances_
attribute?First we will just lok at the Random Forest classifier without any parameters (just use the default) -> gives very good scores.
Look a bit at the feature_importances
Then we see that it is better to aggregate the data by player (We can't show overfitting with 'flawed' data and very good scores, so we first aggregate)
Load the data aggregated by player
Look again at the classifier with default parameters
Show the effect of some parameters to overfitting and use that to...
...find acceptable parameters
Inspect the feature_importances and discuss the results
At the end we look very briefly at other classifiers.
Note that we use the values 1, 2, 3, 4, 5 or WW, W, N, B, BB interchangably for the skin color categories of the players
In [1]:
# imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import show
import itertools
# sklearn
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn import preprocessing as pp
from sklearn.model_selection import KFold , cross_val_score, train_test_split, validation_curve
from sklearn.metrics import make_scorer, roc_curve, roc_auc_score, accuracy_score, confusion_matrix
from sklearn.model_selection import learning_curve
import sklearn.preprocessing as preprocessing
%matplotlib inline
sns.set_context('notebook')
pd.options.mode.chained_assignment = None # default='warn'
pd.set_option('display.max_columns', 500) # to see all columns
Load the preprocessed data and look at it. We preprocess the data in the HW01-1-Preprocessing notebook. The data is already encoded to be used for the RandomForestClassifier.
In [2]:
data = pd.read_csv('CrowdstormingDataJuly1st_preprocessed_encoded.csv', index_col=0)
data_total = data.copy()
print('Number of dayads', data.shape)
data.head()
Out[2]:
In [3]:
print('Number of diads: ', len(data))
print('Number of players: ', len(data.playerShort.unique()))
print('Number of referees: ', len(data.refNum.unique()))
First we just train and test the preprocessed data with the default values of the Random Forest to see what happens. For this first model, we will use all the features (color_rating) and then we will observe which are the most important.
In [4]:
player_colors = data['color_rating']
rf_input_data = data.drop(['color_rating'], axis=1)
player_colors.head() # values 1 to 5
Out[4]:
In [5]:
rf = RandomForestClassifier()
cross_val_score(rf, rf_input_data, player_colors, cv=10, n_jobs=3, pre_dispatch='n_jobs+1', verbose=1)
Out[5]:
Quite good results...
In [6]:
def show_important_features_random_forest(X, y, rf=None):
if rf is None:
rf = RandomForestClassifier()
# train the forest
rf.fit(X, y)
# find the feature importances
importances = rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rf.estimators_], axis=0)
indices = np.argsort(importances)[::-1]
# plot the feature importances
cols = X.columns
print("Feature ranking:")
for f in range(X.shape[1]):
print("%d. feature n° %d %s (%f)" % (f + 1, indices[f], cols[indices[f]], importances[indices[f]]))
# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()
In [7]:
show_important_features_random_forest(rf_input_data, player_colors)
We can see that the most important features are:
- photoID
- player
- the birthday
- playerShort
The obtained result is weird. From personal experience, those 4 features should to be independant of the skin color and they also should be unique to one player. PhotoID is the id of the photo and thus unique for one player and independent of the skin_color. Same about 'player' and 'playerShort' (both represent the players name). Birthday is not necessarily unique, but should not be that important for the skin color since people all over the world are born all the time.
We have to remember that our data contains dyads between player and referee, so a player can appear several times in our data. It could be the reason why the unique features for the players are imprtant. Let's look at the data:
In [8]:
data.playerShort.value_counts()[:10]
Out[8]:
Indeed, some players appear around 200 times, so it is easy to determine the skin-color of the player djibril cisse if he appears both in the training set and in the test set. But in the reality the probability to have 2 djibril cisse with the same birthday and same color skin is almost null. The reason why this attributes are so important is that some of the rows of one player appear in the train and test set, so the classifier can take those to determine the skin-color.
So we drop those attributes and see what happens.
In [9]:
rf_input_data_drop = rf_input_data.drop(['birthday', 'player','playerShort', 'photoID'], axis=1)
In [10]:
rf = RandomForestClassifier()
result = cross_val_score(rf, rf_input_data_drop, player_colors, cv=10, n_jobs=3, pre_dispatch='n_jobs+1', verbose=1)
result
Out[10]:
The accuracy of the classifier dropped a bit, which is no surprise.
In [11]:
show_important_features_random_forest(rf_input_data_drop, player_colors)
That makes more sences, it is possible that dark persons are statistically taller than white persons, but the club and position should not be that important. So we decided to aggregate on the players name to have only one row with the personal information of one player
We do the aggregation in the HW04-1-Preprocessing notebook.
Load the aggregated data.
In [12]:
data_aggregated = pd.read_csv('CrowdstormingDataJuly1st_aggregated_encoded.csv')
data_aggregated.head()
Out[12]:
Drop the player unique features because they can't be usefull to classify since they are unique.
In [13]:
data_aggregated = data_aggregated.drop(['playerShort', 'player', 'birthday'], axis=1)
Train the defualt classifier on the new data and look at the important features
In [14]:
rf = RandomForestClassifier()
aggr_rf_input_data = data_aggregated.drop(['color_rating'], axis=1)
aggr_player_colors = data_aggregated['color_rating']
result = cross_val_score(rf, aggr_rf_input_data, aggr_player_colors,
cv=10, n_jobs=3, pre_dispatch='n_jobs+1', verbose=1)
print("mean result: ", np.mean(result))
result
Out[14]:
The results are not very impressive...
In [15]:
show_important_features_random_forest(aggr_rf_input_data, aggr_player_colors)
That makes a lot more sense. The features are much more equal and several IAT and EXP are on top.
But before going more into detail, we adress the overfitting issue mentioned in the assignment.
The classifier overfitts when the Training accuracy is much higher than the testing accuracy (the classifier fits too much to the trainig data and thus generalizes badly). So we look at the different parameters and discuss how they contribute to the overfitting issue.
To show the impact of each parameter we try different values and plot the train vs test accuracy. Luckily there is a function for this :D
In [16]:
# does the validation with cross validation
def val_curve_rf(input_data, y, param_name, param_range, cv=5, rf=RandomForestClassifier()):
return validation_curve(rf, input_data, y, param_name, param_range, n_jobs=10,verbose=0, cv=cv)
# defines the parameters and the ranges to try
def val_curve_all_params(input_data, y, rf=RandomForestClassifier()):
params = {
'class_weight': ['balanced', 'balanced_subsample', None],
'criterion': ['gini', 'entropy'],
'n_estimators': [1, 10, 100, 500, 1000, 2000],
'max_depth': list(range(1, 100, 5)),
'min_samples_split': [0.001,0.002,0.004,0.005, 0.01, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 0.8, 0.9],
'min_samples_leaf': list(range(1, 200, 5)),
'max_leaf_nodes': [2, 50, 100, 200, 300, 400, 500, 1000]
}
RandomForestClassifier
# does the validation for all parameters from above
for p, r in params.items():
train_scores, valid_scores = val_curve_rf(input_data, y, p, r, rf=rf)
plot_te_tr_curve(train_scores, valid_scores, p, r)
def plot_te_tr_curve(train_scores, valid_scores, param_name, param_range, ylim=None):
"""
Generate the plot of the test and training(validation) accuracy curve.
"""
plt.figure()
if ylim is not None:
plt.ylim(*ylim)
plt.grid()
# if the parameter values are strings
if isinstance(param_range[0], str):
plt.subplot(1, 2, 1)
plt.title(param_name+" train")
plt.boxplot(train_scores.T, labels=param_range)
plt.subplot(1, 2, 2)
plt.title(param_name+" test")
plt.boxplot(valid_scores.T, labels=param_range)
# parameter names are not strings (are numeric)
else:
plt.title(param_name)
plt.ylabel("accuracy")
plt.xlabel("value")
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(valid_scores, axis=1)
test_scores_std = np.std(valid_scores, axis=1)
plt.fill_between(param_range, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(param_range, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(param_range, train_scores_mean, '-', color="r",
label="Training score")
plt.plot(param_range, test_scores_mean, '-', color="g",
label="Testing score")
plt.legend(loc="best")
return plt
In [17]:
val_curve_all_params(aggr_rf_input_data, aggr_player_colors, rf)
n_estimators How many trees to be used. As expected, we see that more trees improve the train and test accuracy, however the test accuracy is bounded and it does not really make sense to use more than 500 trees. (Adding trees also means more computation time). More trees also mean more overfitting. The train accuracy goes almost to 1 while the test stays around 0.42.
min_samples_leaf The minimum number of samples required to be at a leaf node. The higher this value, the less overfitting. It effectively limits how good a tree can fit to a given train set.
criterion The function to measure the quality of a split. You can see that 'entropy' scores higher in the test. So we take it even though gini has a much lover variance.
max_depth The maximal depth of the tree. The higher the more the tree overfits. It seems that no tree is grown more than 10 deep. So we wont limit it.
max_leaf_nodes An upper limit on how many leaf the tree can have. The train accuracy grows until about 400 where there is no more gain in more leaf nodes. probably because the trees don't create that big leaf nodes anyway.
min_samples_split The minimum number of samples required to split an internal node. Has a similar effect and behaviour as _min_samplesleaf.
class_weight Weights associated with classes. Gives more weight to classes with fewer members. It does not seem to have a big influence. Note that the third option is None which sets all classes weight to 1.
The default classifier achieves about 40% accuracy. This is not much considering that about 40% of players are in category 2. This classifier is not better than classifying all players into category 2. So we are going to find better parameters for the classifier.
Based on the plots above and trial and error, we find good parameters for the RandomForestClassifier and look if feature importance changed.
In [18]:
rf_good = RandomForestClassifier(n_estimators=500,
max_depth=None,
criterion='entropy',
min_samples_leaf=2,
min_samples_split=5,
class_weight='balanced_subsample')
aggr_rf_input_data = data_aggregated.drop(['color_rating'], axis=1)
aggr_player_colors = data_aggregated['color_rating']
result = cross_val_score(rf_good, aggr_rf_input_data, aggr_player_colors,
cv=10, n_jobs=3, pre_dispatch='n_jobs+1', verbose=1)
print("mean result: ", np.mean(result))
result
Out[18]:
In [19]:
show_important_features_random_forest(aggr_rf_input_data, aggr_player_colors, rf=rf_good)
We can see that the accuracy is only a bit better. But the most important features are even more balanced. The confidence intervalls are huge and almost all features could be on top. More importantly, the IAT and EXP features seem to play some role in gaining those 4% of accuracy. But clearly we can't say that there is a big difference between players of different skin colors.
Now we observe the confusion matrix to see what the classifier accutally does. We split the data in training ans testing set (test set = 25%) and then we train our random forest using the best parameters selected above:
In [32]:
x_train, x_test, y_train, y_test = train_test_split(aggr_rf_input_data, aggr_player_colors, test_size=0.25)
rf_good.fit(x_train, y_train)
prediction = rf_good.predict(x_test)
accuracy = accuracy_score(y_test, prediction)
print('Accuracy: ',accuracy)
In [33]:
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
cm = confusion_matrix(y_test, prediction)
class_names = ['WW', 'W', 'N', 'B', 'BB']
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')
Our model predicts almost only 2 categories instead of 5. It predicts mostly WW or W. This is because we have imbalanced data and the balancing did not really help apparently. We can see in the matrix above by looking at the True label that there is clearly a majority of of white player. Let's have a look at the exact distribution.
In [22]:
fig, ax = plt.subplots(1, 2, figsize=(8, 4))
ax[0].hist(aggr_player_colors)
ax[1].hist(aggr_player_colors, bins=3)
Out[22]:
Those 2 histograms show the imbalance data. Indeed the 2 first category represent more than 50% of the data. Let's look at the numbers
In [23]:
print('Proportion of WW: {:.2f}%'.format(
100*aggr_player_colors[aggr_player_colors == 1].count()/aggr_player_colors.count()))
print('Proportion of W: {:.2f}%'.format(
100*aggr_player_colors[aggr_player_colors == 2].count()/aggr_player_colors.count()))
print('Proportion of N: {:.2f}%'.format(
100*aggr_player_colors[aggr_player_colors == 3].count()/aggr_player_colors.count()))
print('Proportion of B: {:.2f}%'.format(
100*aggr_player_colors[aggr_player_colors == 4].count()/aggr_player_colors.count()))
print('Proportion of BB: {:.2f}%'.format(
100*aggr_player_colors[aggr_player_colors == 5].count()/aggr_player_colors.count()))
WW and W reprensent 75% of the data.
Now assume a new classifier that always classify in the W category. This classifier has an accuracy of 40%. It means that our classifiery is not much better than always classifying a player as W... What happens when we do a ternary and binary classification?
For ternary we put WW and W in one class, N in the second and B BB in the last (the classes then are WWW, N and BBB.
For binary we merge the N with the BBB class. -> WWW vs NBBB
In [24]:
player_colors_3 = aggr_player_colors.map(lambda x: 1 if(x == 1 or x == 2) else max(x, 2) )
player_colors_2 = player_colors_3.map(lambda x: min(x, 2) )
In [25]:
result3 = cross_val_score(rf_good, aggr_rf_input_data, player_colors_3,
cv=10, n_jobs=3, pre_dispatch='n_jobs+1', verbose=1)
result2 = cross_val_score(rf_good, aggr_rf_input_data, player_colors_2,
cv=10, n_jobs=3, pre_dispatch='n_jobs+1', verbose=1)
In [26]:
print('Proportion of WWW: {:.2f}%'.format(
100*player_colors_2[player_colors_2 == 1].count()/player_colors_2.count()))
print('Proportion of NBBB: {:.2f}%'.format(
100*player_colors_2[player_colors_2 == 2].count()/player_colors_2.count()))
In [27]:
print("mean res3: ", np.mean(result3))
print("mean res2: ", np.mean(result2))
We see that our classifier is only a little bit better than the 'stupid' one. The difference between the ternary and binary classification is also small.
Confusion Matrix of the binary classifier:
In [28]:
x_train, x_test, y_train, y_test = train_test_split(aggr_rf_input_data, player_colors_2, test_size=0.25)
rf_good.fit(x_train, y_train)
prediction = rf_good.predict(x_test)
accuracy = accuracy_score(y_test, prediction)
cm = confusion_matrix(y_test, prediction)
class_names = ['WWW', 'BBB']
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')
Even for the 2 class problem it is hard to predict the colors and the classifier still mostly predicts WWW. From that results we might conclude that there is just not enough difference between the 'black' and 'white' players to classify them.
In [29]:
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
from sklearn.ensemble import AdaBoostClassifier
In [30]:
def make_print_confusion_matrix(clf, clf_name):
x_train, x_test, y_train, y_test = train_test_split(aggr_rf_input_data, player_colors_2, test_size=0.25)
clf.fit(x_train, y_train)
prediction = clf.predict(x_test)
accuracy = np.mean(cross_val_score(clf, aggr_rf_input_data, player_colors_2, cv=5, n_jobs=3, pre_dispatch='n_jobs+1', verbose=1))
print(clf_name + ' Accuracy: ',accuracy)
cm = confusion_matrix(y_test, prediction)
class_names = ['WWW', 'BBB']
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix of '+clf_name)
plt.show()
Only the AdaBoostClassifier is slightly better than our random forest. Probably because it uses our rf_good random forest and combines the results smartly. That might explain the extra 1%
For the MLP classifier we just tried a few architectures, there might be better ones...
Note that the accuracy score is the result of 5 way cross validation.
In [31]:
make_print_confusion_matrix(svm.SVC(kernel='rbf', degree=3, class_weight='balanced'), "SVC")
make_print_confusion_matrix(AdaBoostClassifier(n_estimators=500, base_estimator=rf_good), "AdaBoostClassifier")
make_print_confusion_matrix(MLPClassifier(activation='tanh', learning_rate='adaptive',
solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(100, 100, 50, 50, 2), random_state=1),
"MLPclassifier")
make_print_confusion_matrix(GaussianNB(), "GaussianNB")